Skip to content

Replace remaining real-domain test hosts with .test#5189

Merged
pietern merged 2 commits intomainfrom
more-test-domains
May 6, 2026
Merged

Replace remaining real-domain test hosts with .test#5189
pietern merged 2 commits intomainfrom
more-test-domains

Conversation

@pietern
Copy link
Copy Markdown
Contributor

@pietern pietern commented May 6, 2026

Summary

Follow-up to #5125 (Replace bar.com with bar.test in tests). The same footgun lived in many other test fixtures using real-resolving .com domains in Config.Host, databricks.yml's workspace.host, and .databrickscfg: foo.com, test.com, test2.com, myhost.com, myworkspace.com, x.com, a.com, www.host[12].com, other.host.com.

Why this surfaced now

The SDK started calling resolveHostMetadata on every Config init in databricks/databricks-sdk-go#1542 (released in SDK v0.123.0). The CLI pulled this in via #4799 (bump v0.120.0 → v0.126.0). The resolver performs an HTTP fetch against the host's well-known endpoint with a retry loop. For non-resolving hosts (.test, .invalid, etc.) it fails fast. For real domains it depends on the network: a quick TCP RST or 404 also returns fast; a slow handshake or silent drop triggers a 5-minute retry window.

Timeline of task test (linux, direct) job duration:

Date SHA Duration Notable
Apr 16 4909238 497s SDK 0.127.0 bump
Apr 21 f2443de 466s SDK 0.128.0 bump
Apr 28 987c2d9 494s last fast run
Apr 29 4ad6ba3 1077s #5125 (bar.com -> bar.test)
Apr 30 22be781 1205s +10min vs baseline
May 6 9fc1f8d 1178s

The 0.127.0 / 0.128.0 bumps did not move the needle — the resolver was already live since the v0.126.0 bump in #4799. The +10min jump appeared on Apr 29, which lines up with the GitHub Actions runner network changing behavior toward these external domains — most likely a firewall or egress-filtering change that stopped fast-failing the lookups, turning every test that set a real-domain Config.Host into a 5-minute stall.

#5125 fixed the bar.com family and recovered most of the regression. foo.com lived in a separate file with a slightly different naming convention and missed that sweep, leaving TestFilesToSync_Everything* bleeding ~10min combined — the residual still visible above.

What this PR does

  • Mechanical .com.test swap across all remaining test fixtures that set a host. Same low-risk pattern as Replace bar.com with bar.test in tests #5125.
  • Includes the cmd/root/bundle_test.go set (x.com / a.com — both real domains), the schema testdata pass files (myworkspace.com), and the cmd/auth testdata fixture plus its consumers.
  • Adds a rule to AGENTS.md / CLAUDE.md citing RFC 2606 §2 so future test fixtures use a non-resolving TLD by default.

Test plan

  • go build ./... clean
  • Targeted packages pass: bundle/config/validate, bundle/run, libs/auth, libs/cmdctx, libs/template, cmd/auth, cmd/root, bundle/deploy/terraform, bundle/internal/schema
  • Watch task test (linux, direct) duration on this PR — expect recovery to the ~470s baseline

pietern added 2 commits May 6, 2026 11:41
Follow-up to #5125 ("Replace bar.com with bar.test in tests"), which only
swept bar.com. The same footgun lived in many other test fixtures using
real-resolving .com domains in `Config.Host`, `databricks.yml`'s
`workspace.host`, and `.databrickscfg`: foo.com, test.com, test2.com,
myhost.com, myworkspace.com, x.com, a.com, www.host[12].com,
other.host.com.

Why this surfaced now
---------------------
The SDK well-known endpoint resolver (added during an earlier SDK bump)
performs an HTTP fetch with a retry loop the first time
`Config.EnsureResolved` runs. For non-resolving hosts (`.test`,
`.invalid`, etc.) the resolver fails fast. For real domains it depends
on the network: a quick TCP RST or 404 also returns fast; a slow
handshake or silent drop triggers a 5-minute retry window.

Timeline of `task test (linux, direct)` job duration:

| Date   | SHA       | Duration | Notable                       |
|--------|-----------|----------|-------------------------------|
| Apr 16 | 4909238 | 497s     | SDK 0.127.0 bump              |
| Apr 21 | f2443de | 466s     | SDK 0.128.0 bump              |
| Apr 28 | 987c2d9 | 494s     | last fast run                 |
| Apr 29 | 4ad6ba3 | 1077s    | #5125 (bar.com -> bar.test)   |
| Apr 30 | 22be781 | 1205s    | +10min vs baseline            |
| May 6  | 9fc1f8d | 1178s    |                               |

The SDK bumps did not move the needle. The +10min jump appeared on
Apr 29, which lines up with the GitHub Actions runner network changing
behavior toward these external domains - most likely a firewall or
egress-filtering change that stopped fast-failing the lookups, turning
every test that set a real-domain `Config.Host` into a 5-minute stall.

#5125 fixed the bar.com family and recovered most of the regression.
foo.com lived in a separate file with a slightly different naming
convention and missed that sweep, leaving `TestFilesToSync_Everything*`
bleeding ~10min combined - the residual ~10min still visible in the
table above.

What this PR does
-----------------
Mechanical `.com -> .test` swap across all remaining test fixtures
that set a host. Same low-risk pattern as #5125. Includes the
`cmd/root/bundle_test.go` set (x.com / a.com - both real domains),
the schema testdata pass files (myworkspace.com), and the cmd/auth
testdata fixture plus its consumers.

Also adds a rule to AGENTS.md / CLAUDE.md so future test fixtures use
a non-resolving TLD by default and we don't have to relearn this
through a CI slowdown.

Co-authored-by: Isaac
Make the rule self-explaining: .test (and .example/.invalid/.localhost)
are reserved by RFC 2606 §2 specifically so they don't resolve in real
DNS, which is exactly the property we want for test fixture hosts.

Co-authored-by: Isaac
@pietern pietern temporarily deployed to test-trigger-is May 6, 2026 09:48 — with GitHub Actions Inactive
@pietern pietern temporarily deployed to test-trigger-is May 6, 2026 09:48 — with GitHub Actions Inactive
@pietern pietern merged commit 42a9c25 into main May 6, 2026
24 of 25 checks passed
@pietern pietern deleted the more-test-domains branch May 6, 2026 13:13
janniklasrose added a commit that referenced this pull request May 8, 2026
## Changes

Rename bare-hostname test fixtures (no TLD) to use `.test`:

- `libs/databrickscfg/profile/testdata/databrickscfg` —
`https://default`, `https://query/?o=1234`, `https://foo`
- `libs/databrickscfg/profile/testdata/sample-home/.databrickscfg` —
`https://default`
- `libs/databrickscfg/loader_test.go` — `Host:` fields and the `multiple
profiles matched` error-message assertion
- `libs/databrickscfg/ops_test.go` — only the two cases that match the
renamed shared fixture profiles
- `cmd/labs/project/testdata/installed-in-home/.databrickscfg` —
`https://abc`

`https://accounts.cloud.databricks.com` and
`https://spog[-dup].databricks.com` in the same files are intentionally
left as-is.

## Why

Follow-up to #5189. That sweep replaced real `.com` hosts; this catches
the bare-hostname variants the `.com` filter missed. The SDK well-known
endpoint resolver treats single-label hosts as real DNS lookups, so
these fixtures stall ~3s each on the Windows CI runner.

From `test-output.json` of run 25496981743 (win-tf):

| Test | linux | windows |

|----------------------------------------------------------|--------|---------|
| `TestLoaderSkipsExistingAuth` | 0.01s | 3.04s |
| `TestLoaderSkipsExplicitAuthType` | 0.01s | 2.79s |
| `TestLoaderMatchingHostWithQuery` | 0.03s | 2.78s |
| `TestLoaderMatchingHost` | 0.00s | 2.76s |
| `TestLoaderSkipsNonExistingConfigFile` | 0.01s | 2.75s |
| `TestLoaderSkipsNoMatchingHost` | 0.01s | 1.27s |
| `cmd/labs/project: TestRunningCommand` | 0.10s | 2.98s |
| `cmd/labs/project: TestRenderingTable` | 0.03s | 2.92s |
| `cmd/labs/project: TestRunningBlueprintEchoProfileWrong` | 0.01s |
2.76s |

~25s per Windows job × 2 engines = ~50s expected to come back. Scope is
limited to fixtures that actually trigger the resolver. Other
`https://foo`-style strings in `ops_test.go` are in tempdir-only tests
that never hit DNS and remain untouched to keep the diff focused.

## Tests

- `go test ./libs/databrickscfg/... ./cmd/labs/project/...` passes
locally.
- `./task fmt`, `./task checks`, `./task lint` all clean.
- Watch the Windows-tf job duration on this PR — expect ~25–50s
improvement.

_This PR was written by Claude Code._
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants